Intelligent Land-Vehicle Model Transfer Trajectory Planning Method Based on Deep Reinforcement Learning

To address the problem of model error and tracking dependence in the process of intelligent vehicle motion planning, an intelligent vehicle model transfer trajectory planning method based on deep reinforcement learning is proposed, which is able to obtain an effective control action sequence directly. Firstly, an abstract model of the real environment is extracted. On this basis, a deep deterministic policy gradient (DDPG) and a vehicle dynamic model are adopted to jointly train a reinforcement learning model, and to decide the optimal intelligent driving maneuver. Secondly, the actual scene is transferred to an equivalent virtual abstract scene using a transfer model. Furthermore, the control action and trajectory sequences are calculated according to the trained deep reinforcement learning model. Thirdly, the optimal trajectory sequence is selected according to an evaluation function in the real environment. Finally, the results demonstrate that the proposed method can deal with the problem of intelligent vehicle trajectory planning for continuous input and continuous output. The model transfer method improves the model’s generalization performance. Compared with traditional trajectory planning, the proposed method outputs continuous rotation-angle control sequences. Moreover, the lateral control errors are also reduced.


Introduction
Although intelligent driving technology is developing rapidly, some new problems are emerging during development. In 2016, the first major accident of Tesla happened in the field of automatic driving. Meanwhile, Uber suffered an incident of automation driving hitting pedestrians on 28 March 2018. These problems greatly aroused worldwide attention on the safety of intelligent driving. Therefore, there is still a long way for intelligent driving to improve its innovative and stable safety. As the key to its technology, trajectory planning technology is attracting more and more attention and exploration by researchers at home and abroad.
Trajectory planning is not only applied to intelligent vehicles, but also widely used in the field of robotics and unmanned aerial vehicles [1,2]. There are various ways of trajectory generation in trajectory planning, including the Nelson polynomial, spiral curve equation, spline curve, Bezier curve, etc. [3]. For example, a fourth-order polynomial and dynamic bicycle model were utilized to describe a vehicles

Reinforcement Learning Method
The basic principle of reinforcement learning is presented in Figure 1. When the agent is required to achieve a task, it first interacts with environment (Env) via action (a); then, the impact of the action on the environment brings the agent into a new state (s). At the same time, the agent receives reward feedback (Reward) from the environment. The agent and environment generate a large amount of data through a continuous loop and interaction. Reinforcement learning utilizes these sample data to adjust the strategy . Afterward, it interacts with Env to enter a new state, generating new 1 ( , , , ) t t t t data s a r s   . Subsequently, the new samples are adopted to modify the strategy  for several iterations. After a great deal of iterative learning, the agent finally learns the optimal strategy *  to complete the corresponding task. Strategy π refers to the agent's selection of action a under state s, which is the key problem in reinforcement learning. Strategy π is a map from the agent, which is aware of environmental state s to action a. The random strategy selects the corresponding action according to the probability Cumulative rewards or returns are calculated when a strategy π is given. The definition of cumulative returns is as follows: where 01   is the discount factor of long-term income. The state function is defined as the cumulative return benefit corresponding to state s under a strategy π: The corresponding state-action value function is defined as Strategy π refers to the agent's selection of action a under state s, which is the key problem in reinforcement learning. Strategy π is a map from the agent, which is aware of environmental state s to action a. The random strategy selects the corresponding action according to the probability π(a|s) of each action, while deterministic policy selects action a = π(s) directly according to s.
stochastic Policy : ∑ π(a|s) = 1 deterministic Policy : π(s) : Cumulative rewards or returns are calculated when a strategy π is given. The definition of cumulative returns is as follows: where 0 < γ < 1 is the discount factor of long-term income. The state function is defined as the cumulative return benefit corresponding to state s under a strategy π: The corresponding state-action value function is defined as

Virtual Driving Environment Model Design
According to the description of the intelligent vehicle driving scene, intelligent driving behavior decision tasks include normal driving, changing/overtaking, curve/ramp driving, and so on. Here, the environment model was built as a circular map with three lanes, as shown in Figure 2. Specifically, the green area and outer lane boundary are deemed insurmountable obstacles, while the others are free travel space. The light-blue lines indicate the desired path with rewards, path d = (X d , Y d , φ d ).
The intelligent vehicle, Car = (x c , y c , ϕ c , v), drives in a circular map and learns intelligent driving maneuvers, including straight, changing, and curving driving behavior. Lastly, x c , y c , ϕ c represents the current position and posture information of the intelligent vehicle.

Virtual Driving Environment Model Design
According to the description of the intelligent vehicle driving scene, intelligent driving behavior decision tasks include normal driving, changing/overtaking, curve/ramp driving, and so on. Here, the environment model was built as a circular map with three lanes, as shown in Figure 2  To simplify the environment model Env, the intelligent vehicle has n ranging beams Sensor. Furthermore

& & _=
True if x y get money False other else . The optimal intelligent driving maneuver for intelligent vehicle learning is  ; therefore, the strategy space of the model is To simplify the environment model Env, the intelligent vehicle has n ranging beams Sensor. Furthermore, the farthest distance d Max of each ranging beam is the same. Each ranging beam supports feedback information Sensor i = (d i , x end , y end ) to the intelligent vehicle when it encounters obstacles. Here, d i is the length that the ranging beam is blocked by obstacles or boundaries, and x end , y end are the position coordinates of the beam in contact with obstacles or boundaries. The intelligent vehicle speed keeps a constant v, and the angle control output is −δ min ≤ δ t ≤ δ max . Thus, the key return function Reward for the environment model Env is as follows: Reward = −1 when contacts with obstacles or boundaries R action + R money else , where λ 1 is the positive penalty coefficient, and R action represents the difference penalty between the front and rear successive front wheel angles δ old , δ of an intelligent vehicle. The smaller the change is between successive actions, the smaller the penalty. R money represents the reward for an intelligent vehicle driving on the desired path, (∆x, ∆y, ∆ϕ) is the difference between the current posture (x c , y c , ϕ c ) and the desired path path d = (X d , Y d , φ d ), and ε 1 , ε 2 , ε 3 is the fault-tolerant error. Intelligent vehicles randomize their initial position (x 0 , y 0 , ϕ 0 ) according to the given policies π reset to ensure a more adequate exploration of the environment and the stability of the result. Intelligent vehicle termination conditions at each epoch include 1) contact with obstacles or boundaries; 2) meeting the maximum number of driving steps, n step = Num max . The optimal intelligent driving maneuver for intelligent vehicle learning is π; therefore, the strategy space of the model is π all = {π reset , π}.
State space is assumed as ∑(Sensor) = {d 0 , d 1 , . . . , d n }, and motion space is assumed as ∑(δ) = {δ new }. For the intelligent vehicle and environment model Env, an abstract model M is constructed.

DDPG Network Structure and Algorithm Flow
For complex continuous state space ∑(Sensor) and continuous action space ∑(δ), it is necessary to train the deep reinforcement learning model M θ in a virtual environment M using the DDPG algorithm. The DDPG algorithm consists of an Actor policy network and a Critic evaluation network, as shown in Figure 3. : [ , ] [ , ]

DDPG Network Structure and Algorithm Flow
For complex continuous state space () Sensor  and continuous action space ()   , it is necessary to train the deep reinforcement learning model M  in a virtual environment M using the DDPG algorithm. The DDPG algorithm consists of an Actor policy network and a Critic evaluation network, as shown in Figure 3.
They are adopted as the input to the Actor policy network; therefore, the number of Actor policy network input-layer neurons is 11. Meanwhile, the policy network's hidden layer utilizes three full connected networks; each layer contains 512 neurons. The fully connected layer is followed by batch normalization (BN), before the ReLU (a type of activation function) is adopted as the activation function. At the same time, the last layer of the network chooses tanh as the activation function to map the network output between the interval [-1, 1]. The network output is action Meanwhile, the policy network's hidden layer utilizes three full connected networks; each layer contains 512 neurons. The fully connected layer is followed by BN. Although the hidden layer of the evaluation network and the policy network hold the same structure, its last layer is activated by a linear function such as that in Equation (9). Thus, the network output is the corresponding Q-value,  The state s sensor ∈ ∑(Sensor), speed v, and action δ old of the final moment are combined as s a = (s sensor , v, δ old ). They are adopted as the input to the Actor policy network; therefore, the number of Actor policy network input-layer neurons is 11. Meanwhile, the policy network's hidden layer utilizes three full connected networks; each layer contains 512 neurons. The fully connected layer is followed by batch normalization (BN), before the ReLU (a type of activation function) is adopted as the activation function. At the same time, the last layer of the network chooses tanh as the activation function to map the network output between the interval [−1, 1]. The network output is action δ ∈ ∑(δ). After the state s a and action δ are merged as s c = (s a , δ), then they become the input to the Critic evaluation network. The number of Critic policy network input-layer neurons is 12. Meanwhile, the policy network's hidden layer utilizes three full connected networks; each layer contains 512 neurons. The fully connected layer is followed by BN. Although the hidden layer of the evaluation network and the policy network hold the same structure, its last layer is activated by a linear function such as that in Equation (9). Thus, the network output is the corresponding Q-value, Q(s a , δ), of s c . where x is the input of the last layer, y is the predicted Q-value, and k, b are the weight and bias for network training. DDPG adopts the Actor-Critic framework, including the Actor and Critic structure. Here, the Actor part includes the online policy network and the target policy network, which adopt the deterministic policy to get a definite action from the current state. The Critic part includes an online Q network and a target Q network, in which the Bellman equation of the action-state function Q is utilized to measure the quality of action. The pseudo code of the DDPG algorithm is shown in Algorithm 1 and the DDPG algorithm flow is shown in Figure 4. The input state (d 1 , . . . , d 9 , v, δ old ) and output action δ are represented accordingly. The DDPG algorithm adopts a deterministic policy, and the policy output is an action. Therefore, it needs less sampled data to maximize efficiency. However, this results in the environment not being explored. In order to improve the algorithm's exploration ability, the OU stochastic process was added to the deterministic policy action. Furthermore, the environmental execution was carried out after sampling from a random process of the action. Due to its good correlation over a time series, the OU stochastic process is able to explore environments with momentum properties by the agent.

1.
Randomly initialize Critic online Q network parameters θ Q and Actor s online policy network parameters θ µ .

2.
Initialize Critic target Q network parameters θ Q ← θ Q and Actor s target policy network parameters θ µ ← θ µ .
Initialize the OU random process D for the exploration of action. 6.
Choose action a t based on current strategy µ(s t ) and exploring noise D t : a t = µ(s t ) + D t 9.
Perform the action a t , get the reward r t , and observe the new state s t+1 . 10.
Store the process (s t , a t , r t , s t+1 ) in R.

11.
Sampling from R to get the process (s i , a i , r i , s i+1 ) of batch N.

12.
Set y i = r i + γQ s i+1 , µ s i+1 θ µ θ Q // Q is the state-action value calculated by the target Q network, and µ is the current strategy obtained by the target policy network.

13.
Update Critic s online Q network by minimizing the loss function: Update the Actor s online policy network with sampling gradient:

15.
Update Critic s target Q network : Update Actor s target policy network : end for 18. end for

Model Transfer Strategy
The intelligent vehicle Car obtains the optimal intelligent driving strategy  from the where  is the difference between target heading angle tar

Model Transfer Strategy
The intelligent vehicle Car obtains the optimal intelligent driving strategy π from the environment model M through deep reinforcement learning training. The real environment M* is obviously different from the virtual environment model M; the former usually tends to be more complex and time-varying. Therefore, if the training model of virtual environment M is directly applied to real environment M*, it brings numerous predictable or unpredictable problems.
The decision tasks of intelligent vehicles include driving straight, changing lanes, crossing corners, ramp driving, etc. Thus, the model of intelligent driving tasks is abstracted from the real environment M* and migrated to the virtual environment M, which maps onto a location area corresponding to the ring map in M. Then, the optimal driving strategy π is adopted to plan the control-trajectory sequence C = {δ, ζ} to achieve the driving task, including the control sequence δ = {δ 1 , δ 2 , . . . , δ t } and its corresponding trajectory ζ = {p 1 , p 2 , . . . , p t }. Finally, the intelligent vehicles carry out the task δ to complete the driving task in the real environment M*.
According to a different sub-task Z, such as lane keeping, lane changing, and overtaking, the fixed reference points P re f Z = x re f , y re f , ϕ re f in M are set as a reference target or local tasks. In terms of the driving task's endpoint goal P tar = (x tar , y tar , ϕ tar ) of path planning and intelligent vehicle posture Car M * = x c|M * , y c|M * , ϕ c|M * in the real environment M*, the model transfer strategy is mapped to M using Equation (10). Finally, the intelligent parking posture is obtained as Car = (x c , y c , ϕ c ). : where θ is the difference between target heading angle ϕ tar of the vehicle's driving destination and heading angle ϕ tar of the reference point. In order to keep the heading angle of the target point P tar in the real environment coinciding with the heading angle of reference point P re f in the virtual environment, the real environment coordinate system is rotated by (x , y , ϕ ), which is the pose corresponding to the ego vehicle in the rotated coordinate system, and (x tar , y tar , ϕ tar ), which is the pose corresponding to the target point P in the rotated coordinate system. (∆x, ∆y, ∆ϕ) is the difference in pose between the ego vehicle and the target point in the rotated coordinate system, and (x c , y c , ϕ c ) is the corresponding pose of the ego vehicle in the virtual environment after the model transfer. From Figure 2, the virtual environment seems to contain only two curves, but this is inaccurate. After the training is completed, the vehicle can not only complete straight and turning tasks, but also complete lane-changing operations. Trajectories generated in the lane-changing phase contain different curvatures; thus, mapping ramps to the lane-changing phase in stages can solve the problem of ramp driving.
However, the virtual simulation environment does not consider turning around; thus, when the actual road curvature is very large, the proposed method is not applicable. Figure 5a-c show the model transfer process of a lane change. Figure 5a shows the real-world M* scene. The road condition information is composed of invariants and variables. The invariants include the number of lanes and the width of lanes. The variables are the information regarding obstacles, which is fed back to the distance beam of the intelligent vehicle. The initial planning process is to travel along the current driving lane. When it is detected that there is a vehicle with lower speed than the ego vehicle ahead of it in the current lane, the left lane is taken as a desired path. The driving task is to switch to the left lane based on behavioral decision planning. Figure 5d-f show the model transfer process of a ramp. Figure 5d shows the real-world M* scene. The green vehicle is the current vehicle position, x = [x, y, ϕ, v, ω] T , and the pose information is Car M * = (x, y, ϕ). The green dot in the center of left lane is the scattered point of path planning. Here, the red point is the destination of current driving task P tar = (x tar , y tar , ϕ tar ). From Figure 2, the virtual environment seems to contain only two curves, but this is inaccurate. After the training is completed, the vehicle can not only complete straight and turning tasks, but also complete lane-changing operations. Trajectories generated in the lane-changing phase contain different curvatures; thus, mapping ramps to the lane-changing phase in stages can solve the problem of ramp driving.
However, the virtual simulation environment does not consider turning around; thus, when the actual road curvature is very large, the proposed method is not applicable. Figure 5a-c show the model transfer process of a lane change. Figure 5a shows the real-world M* scene. The road condition information is composed of invariants and variables. The invariants include the number of lanes and the width of lanes. The variables are the information regarding obstacles, which is fed back to the distance beam of the intelligent vehicle. The initial planning process is to travel along the current driving lane. When it is detected that there is a vehicle with lower speed than the ego vehicle ahead of it in the current lane, the left lane is taken as a desired path. The driving task is to switch to the left lane based on behavioral decision planning.

Algorithm Framework of Model Transfer Based on Deep Reinforcement Learning
DRL-MTTP aims to abstract the complex real environment and transfer it to a simple virtual environment through the model. Furthermore, the optimal intelligent driving strategy is applied to the virtual environment, which is trained by the agent after deep reinforcement learning. Thus, the optimal trajectory control sequence is obtained to realize the end-to-end trajectory planning in the real environment. Figure 6 shows a technical diagram of DRL-MTTP.
The framework of DRL-MTTP is shown in Algorithm 2. Table 2. Firstly, sub-task  and pathplanning datasets are initialized according to upper data streams. Afterward, the trajectory-planning stage begins. Then, an appropriate target point target P is selected as the local goal based on the subtask  . Thus, the selection process is shown in Equation (11).
where  

Algorithm Framework of Model Transfer Based on Deep Reinforcement Learning
DRL-MTTP aims to abstract the complex real environment and transfer it to a simple virtual environment through the model. Furthermore, the optimal intelligent driving strategy is applied to the virtual environment, which is trained by the agent after deep reinforcement learning. Thus, the optimal trajectory control sequence is obtained to realize the end-to-end trajectory planning in the real environment. Figure 6 shows a technical diagram of DRL-MTTP.
The framework of DRL-MTTP is shown in Algorithm 2. Firstly, sub-task γ and path-planning datasets are initialized according to upper data streams. Afterward, the trajectory-planning stage begins. Then, an appropriate target point P target is selected as the local goal based on the sub-task γ. Thus, the selection process is shown in Equation (11).
where (x i , y i ) are the corresponding coordinates of P i , s t−1,t is the straight-line distance between P t−1 and P t , l is the arc length threshold of path scatter points which is determined by sub-task γ, and P target is the target point that satisfies the threshold requirement. Subsequently, the target set is obtained by adding noise to the target point P target , and ε x , ε y , ε ϕ satisfies a Gaussian distribution. Furthermore, the corresponding position Car of the intelligent vehicle in the virtual environment M is calculated by a model transfer for each target point P target . The status of the environment s and the status of the intelligent vehicle x are gained by observing its ranging light beam in M. During the planning time T, the deep reinforcement learning model M θ is utilized in each unit of time t to analyze state s and predict the action δ t . At the same time, the dynamic model is adopted to simulate the prediction action. Meanwhile, the environment state s and intelligent vehicle state x are updated, and the track ζ t is recorded. Finally, the control-trajectory sequence pair C = {δ, ζ} under real environment M* is acquired by model transfer.
The second stage is about optimal trajectory selection. In this stage, the evaluation function of each control-trajectory pair is calculated. At the same time, the collision probability of the trajectory ζ is also judged. The control-trajectory pair with minimum J and no collision trajectory are taken as the optimal trajectory.
where T is the termination time, ∆δ is the difference of continuous action, κ 1 , κ 2 is the weight coefficient, and h represents the rectangular area of the vehicle outline at the end of its trajectory. Finally, the result of planning is executed. The intelligent vehicle implements the first τ steps of control sequence δ, and the model reference control is utilized to enhance the robustness and to reduce the influence of system error by the model error. If the task is not terminated, planning action of the first two phases is repeated to realize the intelligent vehicle dynamic trajectory planning.
P target ← S 1 : {P 1 , P 2 , . . . , P m }; //select a target from a path planning scatter set S 1 based on a task 5.
for i = 0, N do 6. P target,i = P target + ξ noise ; //add noise to generate target sets if J < J min and no collision //select the control-trajectory pair with minimum J value and no collision as the optimal trajectory 20.
C optimal = C i ;
end for 24.
//stage of execute the planning result 25.
update terminal; //whether the task ends or not 27. end while where T is the termination time,   is the difference of continuous action, 12 ,  is the weight coefficient, and h represents the rectangular area of the vehicle outline at the end of its trajectory.
Finally, the result of planning is executed. The intelligent vehicle implements the first  steps of control sequence  , and the model reference control is utilized to enhance the robustness and to reduce the influence of system error by the model error. If the task is not terminated, planning action of the first two phases is repeated to realize the intelligent vehicle dynamic trajectory planning.

Deep Reinforcement Learning Model Training
In the virtual simulation environment Env, the circle map was 100 m long and 50 m wide. The driveway was 3.4 m wide. Here, the speed was v = 36 km/h, which was kept constant. The dimension of the ranging light beam was n = 9, and the farthest range was d Max = 20 m, while the largest output of angle was δ max = 0.3 rad. The desired path d was the centerline of middle lane. The errors were ε 1 = ε 2 = 0.1m, ε 3 = 0.5236, the continuous action penalty coefficient was λ 1 = 0.01, and the random initial position was x 0 ∈ [10, 980], y 0 ∈ [10, 50], ϕ 0 ∈ [−π/2, π/2]. The maximum step number was Num max = 600, and the step gap was 0.1 s. The software environment was a Linux operating system with 16 Gb of memory, and the graphics card was a GTX1080 Ti (NVIDIA, Santa Clara, CA, USA). The system took advantage of the deep learning framework TensorFlow.
The greater the learning rate is, the lower the effect of previous training being retained. Similarly, the greater the discount factor is, the more emphasis is placed on experience. The smaller the discount factor is, the more attention is paid to the current return. If the numbers of hidden layers and hidden layer neurons are too little, then the data cannot be fitted well. Conversely, if the numbers of hidden layers and hidden layer neurons are too large, this can easily lead to over-fitting. Therefore, a better network structure and network parameters were designed after several trials. The hyper parameters in the deep reinforcement learning model M θ were set as follows: the discount factor was γ = 0.9, the learning rates of the Actor and Critic networks were both 10 −4 , the optimization method of Adam [32] was adopted, the soft update rate was τ = 0.001, the number of hidden layer neurons was 512, the size of the experience replay pool was 10 4 , the size of the batch was 64, the error was generated by a Gaussian process, the initial variance was var max = 2, the minimum variance was var min = 0.01, and the attenuation rate was 10 −4 .
At low speed, the vehicle dynamic model can be approximated as a bicycle model with two degrees of freedom [4]. The vehicle dynamics model is described in Equation (13).
where (x, y) is the location of the vehicle, ϕ is the yaw angle, ω is the yaw rate, δ is the front-wheel steering angle, U is the longitudinal velocity, and v is the lateral velocity. The definitions and values of the vehicle parameters in Equation (13) are shown in Table 1.
The vehicle status is represented by x = [x, y, ϕ, v, ω] T , and thus, Equation (13) can be expressed as Equation (14).  Figure 7 shows the changes in parameters during training. Figure 7a has an abscissa of "step" and an ordinate of "action". Figure 7b has an abscissa of "epoch" and an ordinate of "cumulative reward". Figure 7c has an abscissa of "step" and an ordinate of "average Q-value". Figure 7d has an abscissa of "step" and an ordinate of "gradient". Figure 7e has an abscissa of "epoch" and an ordinate of "noise". Figure 7f has an abscissa of "step" and an ordinate of "loss".
As shown in Figure 7, the rewards and the average Q-value of the agent gradually increased with the number of iterations during the training process. Finally, it tended to be stable. On the other hand, the loss gradually decreased to 0 with the increase in the number of iterations, indicating the evaluation of the network becoming more and more effective. The noise decreased as the number of iterations increased, providing sufficient exploration capability in the early stages and sufficient exploitation capability in the late stages. Figure 7 shows that the model learned from experience, and continuously approached the optimal strategy. At low speed, the vehicle dynamic model can be approximated as a bicycle model with two degrees of freedom [4]. The vehicle dynamics model is described in Equation (13). 2 2 cos sin sin cos where (x, y) is the location of the vehicle, ϕ is the yaw angle, ω is the yaw rate, δ is the frontwheel steering angle, U is the longitudinal velocity, and v is the lateral velocity. The definitions and values of the vehicle parameters in Equation (13) (14).
( , ) Cornering stiffness of front wheel /N·rad -1 f C 6500 Cornering stiffness of rear wheel /N·rad -1 r C 5200 Figure 7 shows the changes in parameters during training. Figure 7a has an abscissa of "step" and an ordinate of "action". Figure 7b has an abscissa of "epoch" and an ordinate of "cumulative reward". Figure 7c has an abscissa of "step" and an ordinate of "average Q-value". Figure 7d has an abscissa of "step" and an ordinate of "gradient". Figure 7e has an abscissa of "epoch" and an ordinate of "noise". Figure 7f has an abscissa of "step" and an ordinate of "loss".

Curve Track Planning
During the simulation test, the intelligent vehicle set out with different yaw angles, which are shown in Figure 9.  Figure 9i, the lateral error was nearly 0 when going straight. When the vehicle turned, the lateral error increased to about 0.2 m. After turning, the lateral error decreased to nearly 0. As shown in Figure 9j, as the initial heading angle was 0.5 rad, and the absolute value of the vehicle lateral error increased first before decreasing to nearly 0. The following trend was the same as that in Figure 9i.
Although there was a shock in the curve, the intelligent vehicle still ultimately navigated the curve successfully. Meanwhile, the curve of the control sequence and the change rate of the yaw angle were relatively smooth.

Curve Track Planning
During the simulation test, the intelligent vehicle set out with different yaw angles, which are shown in Figure 9.  Figure 9i, the lateral error was nearly 0 when going straight. When the vehicle turned, the lateral error increased to about 0.2 m. After turning, the lateral error decreased to nearly 0. As shown in Figure 9j, as the initial heading angle was 0.5 rad, and the absolute value of the vehicle lateral error increased first before decreasing to nearly 0. The following trend was the same as that in Figure 9i.
Although there was a shock in the curve, the intelligent vehicle still ultimately navigated the curve successfully. Meanwhile, the curve of the control sequence and the change rate of the yaw angle were relatively smooth.

Curve Track Planning
During the simulation test, the intelligent vehicle set out with different yaw angles, which are shown in Figure 9.  Figure 9i, the lateral error was nearly 0 when going straight. When the vehicle turned, the lateral error increased to about 0.2 m. After turning, the lateral error decreased to nearly 0. As shown in Figure 9j, as the initial heading angle was 0.5 rad, and the absolute value of the vehicle lateral error increased first before decreasing to nearly 0. The following trend was the same as that in Figure 9i.
Although there was a shock in the curve, the intelligent vehicle still ultimately navigated the curve successfully. Meanwhile, the curve of the control sequence and the change rate of the yaw angle were relatively smooth.

Experimental Comparison and Analysis of the Three Trajectory Planning Methods
Currently, there are three trajectory-planning methods comparisons and experimental analyses, as described in this section. These trajectory-planning methods are called the optimal trajectoryplanning method based on a cubic polynomial [33], the end-to-end trajectory-planning method [9,10], and the model transfer trajectory-planning method based on deep reinforcement learning. The intelligent vehicle drove following the planning results in each simulation step with the same tracking control error. Here, the experiments only compare and analyze the performance of the planning. The traditional trajectory-planning method adopts the angle control sequence based on a preview window. However, the end-to-end trajectory planning method outputs the angle control sequence directly. The period of the trajectory planning was 100 ms. In other words, the intelligent vehicle was reprogrammed after keeping the same deflection angle for 100 ms.

Arc Straight Track Scene
The arc shown in Figure 10 had a radius of 300 m; thus, it was large enough to be considered as a straight line in a small range. However, the curvature was not equal to zero. The contrast experiment is shown in Figure 11. Figure 11a-c are the experimental results based on the cubicpolynomial dynamic optimal trajectory-planning method. Figure 11d-f are the experimental results of the end-to-end trajectory-planning method. Figure 11g-i are the experimental results of the model transfer trajectory-planning method based on deep reinforcement learning. Figure 11j shows the lateral error from the center of the lane at each time point. Figure 11a,d,g have abscissas of "X/m" and ordinates of "Y/m". The blue dotted lines in the figures represent the "expected path" and the red solid lines represent the "actual trajectory". Figure 11b,e,h have abscissas of "time/0.1 s" and ordinates of "heading angle/rad". Figure 11c,f,i have abscissas of "time/0.1 s" and ordinates of "steering-wheel angle/°". Figure 11j has an abscissa of "time point" and an ordinate of "lateral error/m".
The actual trajectories, represented by the solid line, in the three methods were basically the same as the expected trajectory, represented by the dotted line. However, the rotation-angle control sequence of the first two methods oscillated continuously in the small range. The control sequence of our proposed method showed a periodic and continuous variation. The experimental results verified that the model transfer trajectory-planning method based on deep reinforcement learning performs

Experimental Comparison and Analysis of the Three Trajectory Planning Methods
Currently, there are three trajectory-planning methods comparisons and experimental analyses, as described in this section. These trajectory-planning methods are called the optimal trajectory-planning method based on a cubic polynomial [33], the end-to-end trajectory-planning method [9,10], and the model transfer trajectory-planning method based on deep reinforcement learning. The intelligent vehicle drove following the planning results in each simulation step with the same tracking control error. Here, the experiments only compare and analyze the performance of the planning. The traditional trajectory-planning method adopts the angle control sequence based on a preview window. However, the end-to-end trajectory planning method outputs the angle control sequence directly. The period of the trajectory planning was 100 ms. In other words, the intelligent vehicle was reprogrammed after keeping the same deflection angle for 100 ms.

Arc Straight Track Scene
The arc shown in Figure 10 had a radius of 300 m; thus, it was large enough to be considered as a straight line in a small range. However, the curvature was not equal to zero. The contrast experiment is shown in Figure 11. Figure 11a-c are the experimental results based on the cubic-polynomial dynamic optimal trajectory-planning method. Figure 11d-f are the experimental results of the end-to-end trajectory-planning method. Figure 11g-i are the experimental results of the model transfer trajectory-planning method based on deep reinforcement learning. Figure 11j shows the lateral error from the center of the lane at each time point. Figure 11a,d,g have abscissas of "X/m" and ordinates of "Y/m". The blue dotted lines in the figures represent the "expected path" and the red solid lines represent the "actual trajectory". Figure 11b,e,h have abscissas of "time/0.1 s" and ordinates of "heading angle/rad". Figure 11c,f,i have abscissas of "time/0.1 s" and ordinates of "steering-wheel angle/ • ". Figure 11j has an abscissa of "time point" and an ordinate of "lateral error/m".
The actual trajectories, represented by the solid line, in the three methods were basically the same as the expected trajectory, represented by the dotted line. However, the rotation-angle control sequence of the first two methods oscillated continuously in the small range. The control sequence of our proposed method showed a periodic and continuous variation. The experimental results verified that the model transfer trajectory-planning method based on deep reinforcement learning performs well, and the control action is continuous when driving nearly straight. As shown in Figure 11j, when adopting the optimal trajectory-planning method based on the cubic polynomial, the trajectory of the vehicle was prone to oscillation. There was a medium mean lateral error when adopting the end-to-end trajectory-planning method, while there was a minimum mean lateral error when adopting the method based on MTTP/DDPG. our proposed method showed a periodic and continuous variation. The experimental results verified that the model transfer trajectory-planning method based on deep reinforcement learning performs well, and the control action is continuous when driving nearly straight. As shown in Figure 11j, when adopting the optimal trajectory-planning method based on the cubic polynomial, the trajectory of the vehicle was prone to oscillation. There was a medium mean lateral error when adopting the end-toend trajectory-planning method, while there was a minimum mean lateral error when adopting the method based on MTTP/DDPG. our proposed method showed a periodic and continuous variation. The experimental results verified that the model transfer trajectory-planning method based on deep reinforcement learning performs well, and the control action is continuous when driving nearly straight. As shown in Figure 11j, when adopting the optimal trajectory-planning method based on the cubic polynomial, the trajectory of the vehicle was prone to oscillation. There was a medium mean lateral error when adopting the end-toend trajectory-planning method, while there was a minimum mean lateral error when adopting the method based on MTTP/DDPG.   Figure 12 is the schematic diagram of the S ramp. The intelligent vehicle was initially situated in the left lane of the lower right corner. After driving at a yaw angle from the north, the vehicle turned left and entered a 45 • ramp. After driving a distance, it turned right in the middle lane and completed the intelligent driving maneuver. The experiment results are shown in Figure 13. Figure 13a-c are the experimental results based on the cubic polynomial dynamic optimal trajectory planning. Figure 13d-f are the experimental results of the end-to-end trajectory planning. Figure 13g-i are the experimental results of the model transfer trajectory-planning method based on deep reinforcement learning. Figure 13j shows the lateral error from the center of the lane at each time point. The abscissas of Figure 13a,d,g are "X/m", and the ordinates are "Y/m". The blue dotted line in Figure 13a represents the "expected path" and the red solid line represents the "actual trajectory". Figure 13b,e,h have abscissas of "time/0.1 s" and ordinates of "heading angle/rad". The abscissas of Figure 13c,f,i are "time/0.1 s", and their ordinates are "steering-wheel angle/ • ". Figure 13j has an abscissa of "time point" and an ordinate of "lateral error/m".

S-type Ramp Scene
According to the comparison results between actual trajectory and desired path based on the three planning methods, the results show that the proposed method allowed the intelligent vehicle to maintain a better turning performance, especially in the combination of straight and curved roads. Finally, the lateral deviation of our proposed method was the least, and the it outperformed the other two methods. When entering the ramp or leaving the ramp, there was a greater lateral error. In general, the average lateral error adopting the MTTP/DDPG method was the minimum.  Figure 12 is the schematic diagram of the S ramp. The intelligent vehicle was initially situated in the left lane of the lower right corner. After driving at a yaw angle from the north, the vehicle turned left and entered a 45° ramp. After driving a distance, it turned right in the middle lane and completed the intelligent driving maneuver. The experiment results are shown in Figure 13. Figure 13a-c are the experimental results based on the cubic polynomial dynamic optimal trajectory planning. Figure  13d-f are the experimental results of the end-to-end trajectory planning. Figure 13g-i are the experimental results of the model transfer trajectory-planning method based on deep reinforcement learning. Figure 13j shows the lateral error from the center of the lane at each time point. The abscissas of Figure 13a,d,g are "X/m", and the ordinates are "Y/m". The blue dotted line in Figure 13(a) represents the "expected path" and the red solid line represents the "actual trajectory". Figure 13b,e,h have abscissas of "time/0.1 s" and ordinates of "heading angle/rad". The abscissas of Figure 13c,f,i are "time/0.1 s", and their ordinates are "steering-wheel angle/°". Figure 13j has an abscissa of "time point" and an ordinate of "lateral error/m".

S-type Ramp Scene
According to the comparison results between actual trajectory and desired path based on the three planning methods, the results show that the proposed method allowed the intelligent vehicle to maintain a better turning performance, especially in the combination of straight and curved roads. Finally, the lateral deviation of our proposed method was the least, and the it outperformed the other two methods. When entering the ramp or leaving the ramp, there was a greater lateral error. In general, the average lateral error adopting the MTTP/DDPG method was the minimum.   Figure 12 is the schematic diagram of the S ramp. The intelligent vehicle was initially situated in the left lane of the lower right corner. After driving at a yaw angle from the north, the vehicle turned left and entered a 45° ramp. After driving a distance, it turned right in the middle lane and completed the intelligent driving maneuver. The experiment results are shown in Figure 13. Figure 13a-c are the experimental results based on the cubic polynomial dynamic optimal trajectory planning. Figure  13d-f are the experimental results of the end-to-end trajectory planning. Figure 13g-i are the experimental results of the model transfer trajectory-planning method based on deep reinforcement learning. Figure 13j shows the lateral error from the center of the lane at each time point. The abscissas of Figure 13a,d,g are "X/m", and the ordinates are "Y/m". The blue dotted line in Figure 13(a) represents the "expected path" and the red solid line represents the "actual trajectory". Figure 13b,e,h have abscissas of "time/0.1 s" and ordinates of "heading angle/rad". The abscissas of Figure 13c,f,i are "time/0.1 s", and their ordinates are "steering-wheel angle/°". Figure 13j has an abscissa of "time point" and an ordinate of "lateral error/m".

S-type Ramp Scene
According to the comparison results between actual trajectory and desired path based on the three planning methods, the results show that the proposed method allowed the intelligent vehicle to maintain a better turning performance, especially in the combination of straight and curved roads. Finally, the lateral deviation of our proposed method was the least, and the it outperformed the other two methods. When entering the ramp or leaving the ramp, there was a greater lateral error. In general, the average lateral error adopting the MTTP/DDPG method was the minimum.

Real Vehicle Verification of Dynamic Trajectory Planning
The vehicle's real-time position was collected using the GPS (Global Positioning System) and IMU (Inertial Measurement Unit). The sensory data of the surroundings were acquired using a camera, lidar, and millimeter-wave radar, which were mounted on the driverless vehicle.
The camera was adopted to detect lane lines. The identification of obstacles depended mainly on the radar and lidar. The fusion process of obstacle information detected by multiple sensors was as follows: firstly, each sensor extracted obstacles accordingly. Then, the Hungarian algorithm (HA) [34] was adopted to match each object extracted by each sensor. Finally, the matched data were fused with the Kalman filter (KF) [35] to get the fusion data.
The lane's centerline was taken as the desired path based on the lane line detected by the camera. Different points on the desired path were taken as target P . After the model transfer strategy, a set of trajectories was obtained.

Real Vehicle Verification of Dynamic Trajectory Planning
The vehicle's real-time position was collected using the GPS (Global Positioning System) and IMU (Inertial Measurement Unit). The sensory data of the surroundings were acquired using a camera, lidar, and millimeter-wave radar, which were mounted on the driverless vehicle.
The camera was adopted to detect lane lines. The identification of obstacles depended mainly on the radar and lidar. The fusion process of obstacle information detected by multiple sensors was as follows: firstly, each sensor extracted obstacles accordingly. Then, the Hungarian algorithm (HA) [34] was adopted to match each object extracted by each sensor. Finally, the matched data were fused with the Kalman filter (KF) [35] to get the fusion data.
The lane's centerline was taken as the desired path based on the lane line detected by the camera. Different points on the desired path were taken as P target . After the model transfer strategy, a set of trajectories was obtained. Considering the constraints of the fused obstacle information, the optimal trajectory with no collisions was selected according to the cost function. Figure 14 shows the actual vehicle's dynamic trajectory planning process. Figure 14a-i were captured when the vehicle was turning in different scenes. Here, Figure 14a-f are the outside view, while Figure 14g-i are the inside view of the driverless vehicle. Figure 14 shows that the driverless performance of the intelligent vehicle was stable and efficient when turning around 90 • . Figure 14j Figure 14r is the tracking result when the actual vehicle verified the curve driving.
[34] was adopted to match each object extracted by each sensor. Finally, the matched data were fused with the Kalman filter (KF) [35] to get the fusion data.
The lane's centerline was taken as the desired path based on the lane line detected by the camera. Different points on the desired path were taken as target P . After the model transfer strategy, a set of trajectories was obtained.  Figure 14 shows the actual vehicle's dynamic trajectory planning process. Figure 14a-i were captured when the vehicle was turning in different scenes. Here, Figure 14a-f are the outside view, while Figure 14g-i are the inside view of the driverless vehicle. Figure 14 shows that the driverless performance of the intelligent vehicle was stable and efficient when turning around 90°. Figure 14j-

Conclusions
It is difficult for traditional trajectory-planning method to eliminate errors due to vehicle models and road conditions. Furthermore, there are no vehicle dynamics constraints. A model transfer trajectory-planning method based on deep reinforcement learning was proposed in this paper. At first, the complex real environment was abstracted using MTTP, then the abstracted model was transferred into a simple virtual environment through the transfer model. Secondly, the optimal intelligent driving maneuver after deep reinforcement learning training was applied to obtain the optimal control-trajectory sequence in the virtual environment. Thereby, the end-to-end trajectory planning of the intelligent vehicle in a real environment was realized. Moreover, an evaluation function was designed to estimate the planning validity of the control-trajectory sequences, and to judge the risk of collision in a real environment. Furthermore, an optimal control-trajectory sequence was decided and executed by the intelligent land vehicle. Finally, the comparison analysis of multiple driving scenes and multiple trajectory-planning methods verified the better optimization performance of MTTP, showing that it achieved a more continuous rotation-angle control sequence and a smaller lateral error for the intelligent land vehicle. However, the speed of the vehicles was assumed as a constant, and that they were driving in a typical structured environment. Because the virtual simulation environment in this paper did not consider turning around, the proposed method is not applicable when the actual road curvature is very large. The next stage will be to further consider variable-speed driving and more complex environments.